statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...

, a spurious relationship or spurious correlation is a

mathematical relationship In mathematics, a binary relation associates elements of one set, called the ''domain'', with elements of another set, called the ''codomain''. A binary relation over sets and is a new set of ordered pairs consisting of elements in and ...

in which two or more events or variables are

associated Associated may refer to: *Associated, former name of Avon, Contra Costa County, California * Associated Hebrew Schools of Toronto, a school in Canada *Associated Newspapers, former name of DMG Media, a British publishing company See also *Associati ...

but '' not'' causally related, due to either coincidence or the presence of a certain third, unseen factor (referred to as a "common response variable", "confounding factor", or "

lurking variable In statistics, a confounder (also confounding variable, confounding factor, extraneous determinant or lurking variable) is a variable that influences both the dependent variable and independent variable, causing a spurious association. Con ...

").

Examples

An example of a spurious relationship can be found in the

time-series In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Exa ...

literature, where a spurious regression is a regression that provides misleading statistical evidence of a

linear relationship In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistic ...

between independent

non-stationary In mathematics and statistics, a stationary process (or a strict/strictly stationary process or strong/strongly stationary process) is a stochastic process whose unconditional joint probability distribution does not change when shifted in time. Con ...

variables. In fact, the non-stationarity may be due to the presence of a

unit root In probability theory and statistics, a unit root is a feature of some stochastic processes (such as random walks) that can cause problems in statistical inference involving time series models. A linear stochastic process has a unit root if 1 is ...

in both variables. In particular, any two nominal economic variables are likely to be correlated with each other, even when neither has a causal effect on the other, because each equals a real variable times the

price level The general price level is a hypothetical measure of overall prices for some set of goods and services (the consumer basket), in an economy or monetary union during a given interval (generally one day), normalized relative to some base set. ...

, and the common presence of the price level in the two data series imparts correlation to them. (See also spurious correlation of ratios.) Another example of a spurious relationship can be seen by examining a city's

ice cream Ice cream is a sweetened frozen food typically eaten as a snack or dessert. It may be made from milk or cream and is flavoured with a sweetener, either sugar or an alternative, and a spice, such as cocoa or vanilla, or with fruit such as str ...

sales. The sales might be highest when the rate of drownings in city

swimming pool A swimming pool, swimming bath, wading pool, paddling pool, or simply pool, is a structure designed to hold water to enable Human swimming, swimming or other leisure activities. Pools can be built into the ground (in-ground pools) or built ...

s is highest. To allege that ice cream sales cause drowning, or vice versa, would be to imply a spurious relationship between the two. In reality, a

heat wave A heat wave, or heatwave, is a period of excessively hot weather, which may be accompanied by high humidity, especially in oceanic climate countries. While definitions vary, a heat wave is usually measured relative to the usual climate in the ...

may have caused both. The heat wave is an example of a hidden or unseen variable, also known as a

confounding variable In statistics, a confounder (also confounding variable, confounding factor, extraneous determinant or lurking variable) is a variable that influences both the dependent variable and independent variable, causing a spurious association. Con ...

. Another commonly noted example is a series of Dutch statistics showing a positive correlation between the number of storks nesting in a series of springs and the number of human babies born at that time. Of course there was no causal connection; they were correlated with each other only because they were correlated with the weather nine months before the observations. In rare cases, a spurious relationship can occur between two completely unrelated variables without any confounding variable, as was the case between the success of the

Washington Redskins The Washington Commanders are a professional American football team based in the Washington metropolitan area. The Commanders compete in the National Football League (NFL) as a member club of the league's National Football Conference (NFC) N ...

professional football team in a specific game before each

presidential election A presidential election is the election of any head of state whose official title is President. Elections by country Albania The president of Albania is elected by the Assembly of Albania who are elected by the Albanian public. Chile The pre ...

and the success of the incumbent President's political party in said election. For 16 consecutive elections between 1940 and 2000, the

Redskins Rule The Redskins Rule is a spurious relationship in which the results of National Football League (NFL) games correlated strongly with the results of subsequent United States presidential elections. Briefly stated, there was a strong correlation betw ...

correctly matched whether the incumbent President's political party would retain or lose the Presidency. The rule eventually failed shortly after

Elias Sports Bureau The Elias Sports Bureau is a privately held company providing historical and current statistical information for the major professional sports leagues operating in the United States and Canada. Elias is the official statistician for Major League Ba ...

discovered the correlation in 2000; in 2004, 2012 and 2016, the results of the Redskins game and the election did not match. In a similar spurious relationship involving the

National Football League The National Football League (NFL) is a professional American football league that consists of 32 teams, divided equally between the American Football Conference (AFC) and the National Football Conference (NFC). The NFL is one of the ...

, in the 1970s,

Leonard Koppett Leonard Koppett (September 15, 1923 – June 22, 2003) was an American sportswriter. Born in Moscow as Leonard Kopeliovich, Koppett moved with his family from Moscow, Russia to the United States when he was five years old. They lived in The Bronx, ...

noted a correlation between the direction of the stock market and the winning conference of that year's

Super Bowl The Super Bowl is the annual final playoff game of the National Football League (NFL) to determine the league champion. It has served as the final game of every NFL season since 1966, replacing the NFL Championship Game. Since 2022, the game ...

, the Super Bowl indicator; the relationship maintained itself for most of the 20th century before reverting to more random behavior in the 21st.

Hypothesis testing

Often one tests a null hypothesis of no correlation between two variables, and chooses in advance to reject the hypothesis if the correlation computed from a data sample would have occurred in less than (say) 5% of data samples if the null hypothesis were true. While a true null hypothesis will be accepted 95% of the time, the other 5% of the times having a true null of no correlation a zero correlation will be wrongly rejected, causing acceptance of a correlation which is spurious (an event known as

Type I error In statistical hypothesis testing, a type I error is the mistaken rejection of an actually true null hypothesis (also known as a "false positive" finding or conclusion; example: "an innocent person is convicted"), while a type II error is the fa ...

). Here the spurious correlation in the sample resulted from random selection of a sample that did not reflect the true properties of the underlying population.

Detecting spurious relationships

The term "spurious relationship" is commonly used in

and in particular in

experimental research An experiment is a procedure carried out to support or refute a hypothesis, or determine the efficacy or likelihood of something previously untried. Experiments provide insight into cause-and-effect by demonstrating what outcome occurs when a ...

techniques, both of which attempt to understand and predict direct causal relationships (X → Y). A non-causal correlation can be spuriously created by an antecedent which causes both (W → X and W → Y). Mediating variables, (X → W → Y), if undetected, estimate a total effect rather than direct effect without adjustment for the mediating variable M. Because of this, experimentally identified

correlation In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics ...

s do not represent

causal relationships Causality (also referred to as causation, or cause and effect) is influence by which one event, process, state, or object (''a'' ''cause'') contributes to the production of another event, process, state, or object (an ''effect'') where the cau ...

unless spurious relationships can be ruled out.

Experiments

In experiments, spurious relationships can often be identified by controlling for other factors, including those that have been theoretically identified as possible confounding factors. For example, consider a researcher trying to determine whether a new drug kills bacteria; when the researcher applies the drug to a bacterial culture, the bacteria die. But to help in ruling out the presence of a confounding variable, another culture is subjected to conditions that are as nearly identical as possible to those facing the first-mentioned culture, but the second culture is not subjected to the drug. If there is an unseen confounding factor in those conditions, this control culture will die as well, so that no conclusion of efficacy of the drug can be drawn from the results of the first culture. On the other hand, if the control culture does not die, then the researcher cannot reject the hypothesis that the drug is efficacious.

Non-experimental statistical analyses

Disciplines whose data are mostly non-experimental, such as

economics Economics () is the social science that studies the Production (economics), production, distribution (economics), distribution, and Consumption (economics), consumption of goods and services. Economics focuses on the behaviour and intera ...

, usually employ observational data to establish causal relationships. The body of statistical techniques used in economics is called

econometrics Econometrics is the application of Statistics, statistical methods to economic data in order to give Empirical evidence, empirical content to economic relationships.M. Hashem Pesaran (1987). "Econometrics," ''The New Palgrave: A Dictionary of ...

. The main statistical method in econometrics is multivariable

regression analysis In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...

. Typically a linear relationship such as :

y = a_0 + a_1x_1 + a_2x_2 + \cdots + a_kx_k + e

is hypothesized, in which

y

is the dependent variable (hypothesized to be the caused variable),

x_j

for ''j'' = 1, ..., ''k'' is the ''j''^th independent variable (hypothesized to be a causative variable), and

e

is the error term (containing the combined effects of all other causative variables, which must be uncorrelated with the included independent variables). If there is reason to believe that none of the

x_j

s is caused by ''y'', then estimates of the coefficients

a_j

are obtained. If the null hypothesis that

a_j=0

is rejected, then the alternative hypothesis that

a_ \ne 0

and equivalently that

x_j

causes ''y'' cannot be rejected. On the other hand, if the null hypothesis that

a_j=0

cannot be rejected, then equivalently the hypothesis of no causal effect of

x_j

on ''y'' cannot be rejected. Here the notion of causality is one of contributory causality: If the true value

a_j \ne 0

, then a change in

x_j

will result in a change in ''y'' ''unless'' some other causative variable(s), either included in the regression or implicit in the error term, change in such a way as to exactly offset its effect; thus a change in

x_j

is ''not sufficient'' to change ''y''. Likewise, a change in

x_j

is ''not necessary'' to change ''y'', because a change in ''y'' could be caused by something implicit in the error term (or by some other causative explanatory variable included in the model). Regression analysis controls for other relevant variables by including them as regressors (explanatory variables). This helps to avoid mistaken inference of causality due to the presence of a third, underlying, variable that influences both the potentially causative variable and the potentially caused variable: its effect on the potentially caused variable is captured by directly including it in the regression, so that effect will not be picked up as a spurious effect of the potentially causative variable of interest. In addition, the use of multivariate regression helps to avoid wrongly inferring that an indirect effect of, say ''x''₁ (e.g., ''x''₁ → ''x''₂ → ''y'') is a direct effect (''x''₁ → ''y''). Just as an experimenter must be careful to employ an experimental design that controls for every confounding factor, so also must the user of multiple regression be careful to control for all confounding factors by including them among the regressors. If a confounding factor is omitted from the regression, its effect is captured in the error term by default, and if the resulting error term is correlated with one (or more) of the included regressors, then the estimated regression may be biased or inconsistent (see

omitted variable bias In statistics, omitted-variable bias (OVB) occurs when a statistical model leaves out one or more relevant variables. The bias results in the model attributing the effect of the missing variables to those that were included. More specifically, OV ...

). In addition to regression analysis, the data can be examined to determine if

Granger causality The Granger causality test is a statistical hypothesis test for determining whether one time series is useful in forecasting another, first proposed in 1969. Ordinarily, regressions reflect "mere" correlations, but Clive Granger argued that cau ...

exists. The presence of Granger causality indicates both that ''x'' precedes ''y'', and that ''x'' contains unique information about ''y''.

Other relationships

There are several other relationships defined in statistical analysis as follows. *Direct relationship * Mediating relationship * Moderating relationship

Footnotes

References

* *

External links

Spurious correlations
– a collection of examples {{fallacies Causal fallacies Logic and statistics Independence (probability theory)